Image Data Exploration and Dimensionality Reduction on RSNA X-Ray Data

by: Carter Koehler

1. Understanding the Data

The primary purpose of this dataset is to determine a person's age based on their bone structure examined in an x-ray. The real value of this is determining how mature a person's skeleton is to determine if the patient has any perceptible growth defects, especially for children. For example, if a person's skeleton indicates that they should be 7 years old, but they are in fact 10, they are likely suffering from some sort of endocrine disorder. However, there are not many good quantitative measures for this, even fewer of which are fast and fully automated.

If a prediction algorithm were able to compute the age of a person's skeleton based on their x-ray data, that is enough to determine whether they are suffering from endocrine dysfunction.

This test does not need to be perfect, but it should do a reasonable job at predicting the presence or absence of endocrine dysfuntion. In particular, this form of testing should serve as a basic, first-pass screen, so that hospitals can save money on doing more costly, but more precise, tests on each individual. As a basic screening technique, when predicting whether or not someone's bones are inside or outside of the tolerable range of their age, we should prioritze decreasing the false negative rate to decreasing the false positive rate. It is always easier to run a patient through extra tests to find out that they aren't affected than it is to turn away a patient who is affected. The false negative rate should be very low, too, perhaps 2%.

As for the actual acceptable range, we would like to predict someone's bone age to within a year of what its actual value. This is a rough estimate, as there isn't too much readily available information on the link between bone age and endocrine dysfunction. When Iglovikov et al. tried to solve a similar problem, they were able to accurately predict boneage to within around 6 months, so getting to within one year using somewhat less sophisticated methods seems a reasonable goal. In the dataset, age is given in months (all of the data are of children under 20), so we will attempt to get that measure within 12.

2. Preparing the Data

2.1 Reading in the Data

In [1]:
import numpy as np
import scipy as sci
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

import os
import sys

from matplotlib import rcParams

rcParams['figure.figsize'] = 8,7
In [2]:
file_dir = os.path.join(os.pardir, 'data')

raw_data = pd.read_csv(os.path.join(file_dir, 'boneage.csv'))

label_data = raw_data.iloc[:2000]

label_data.head()
Out[2]:
id boneage male
0 1377 180 False
1 1378 12 False
2 1379 94 False
3 1380 120 True
4 1381 82 False
In [3]:
# HDD's are slow

image_data = pd.DataFrame(columns=range(65536))

The dataset we are using here is very large, 9.1 GB at its fullest. So, we will do some subsampling and some compression. Much of the compression we will do here is offline and loaded in later.

The eventual form of the data will be about 2000, 256x256 grayscale images, unravelled and placed in a Pandas Dataframe.

In [4]:
compressed_dir = os.path.join(file_dir, 'boneage-compressed')

Staring here, do not run the notebook cells. They are meant to be run once and then recovered later.

In [5]:
# from PIL import Image

# # Do not run this. Only to be run once to compress images

# image_dir = os.path.join(file_dir, 'boneage-training-dataset')

# for i in label_data.id: #list(label_data.id):
#     file_name = str(i) + '.png'
#     temp_image = Image.open(os.path.join(image_dir, file_name))
    
#     # resize image to close to average (computed offline)

#     temp_image = temp_image.resize((256,256))
#     temp_image.save(os.path.join(compressed_dir, str(i) + '-compressed.png'))
    
# #     pix_vals = np.array(temp_image.resize((256,256))).flatten()
# #     image_data = pd.concat([image_data, pd.DataFrame(pix_vals).T])
    
#     print(file_name + ' has been added')

# for i in label_data.id:
#     file_name = os.path.join(compressed_dir, str(i) + '-compressed.png')
    
#     current_image = Image.open(file_name)
#     pix_vals = np.array(current_image).flatten()
    
#     image_data = pd.concat([image_data, pd.DataFrame(pix_vals).T])

#     if i%100 == 0:
#         print(file_name + ' has been added', image_data.shape)

# import pickle

# pickle.dump(image_data, open(os.path.join(compressed_dir, 'image_data.p'), 'wb'))

You may start running these cells again.

In [5]:
import pickle

image_data = pickle.load(open(os.path.join(compressed_dir, 'image_data.p'), 'rb')).astype(float)

image_data.index = range(0,image_data.shape[0])
In [6]:
# image_data = image_data.astype(float)

image_data
Out[6]:
0 1 2 3 4 5 6 7 8 9 ... 65526 65527 65528 65529 65530 65531 65532 65533 65534 65535
0 234.0 235.0 200.0 148.0 116.0 91.0 63.0 60.0 61.0 54.0 ... 46.0 47.0 50.0 47.0 42.0 46.0 49.0 48.0 48.0 48.0
1 130.0 128.0 130.0 128.0 128.0 129.0 130.0 127.0 128.0 128.0 ... 131.0 127.0 128.0 128.0 130.0 130.0 128.0 130.0 131.0 128.0
2 130.0 129.0 127.0 126.0 126.0 126.0 127.0 128.0 127.0 128.0 ... 127.0 126.0 130.0 126.0 127.0 128.0 127.0 127.0 126.0 127.0
3 61.0 62.0 60.0 62.0 61.0 57.0 61.0 61.0 57.0 57.0 ... 34.0 36.0 32.0 35.0 34.0 31.0 33.0 35.0 33.0 31.0
4 129.0 129.0 127.0 129.0 128.0 127.0 126.0 127.0 126.0 129.0 ... 129.0 129.0 127.0 127.0 128.0 127.0 127.0 127.0 126.0 128.0
5 101.0 101.0 99.0 96.0 98.0 96.0 100.0 95.0 99.0 96.0 ... 62.0 62.0 65.0 64.0 62.0 61.0 62.0 64.0 63.0 63.0
6 115.0 114.0 113.0 111.0 112.0 114.0 111.0 113.0 111.0 112.0 ... 108.0 109.0 107.0 107.0 108.0 107.0 107.0 108.0 107.0 108.0
7 201.0 190.0 167.0 145.0 128.0 116.0 112.0 111.0 113.0 112.0 ... 107.0 107.0 107.0 107.0 108.0 108.0 108.0 110.0 109.0 108.0
8 16.0 14.0 18.0 15.0 16.0 11.0 17.0 19.0 11.0 10.0 ... 19.0 20.0 20.0 23.0 19.0 22.0 15.0 21.0 22.0 17.0
9 126.0 125.0 124.0 123.0 119.0 125.0 124.0 125.0 120.0 120.0 ... 134.0 119.0 106.0 92.0 82.0 67.0 49.0 48.0 46.0 44.0
10 23.0 22.0 18.0 19.0 23.0 20.0 17.0 16.0 15.0 15.0 ... 42.0 45.0 43.0 41.0 44.0 46.0 40.0 43.0 44.0 45.0
11 42.0 39.0 39.0 39.0 40.0 41.0 38.0 48.0 67.0 89.0 ... 29.0 30.0 30.0 28.0 30.0 33.0 29.0 32.0 31.0 30.0
12 175.0 178.0 175.0 179.0 180.0 177.0 177.0 175.0 176.0 180.0 ... 117.0 118.0 119.0 120.0 119.0 117.0 118.0 119.0 119.0 119.0
13 82.0 81.0 81.0 81.0 80.0 78.0 79.0 80.0 79.0 79.0 ... 74.0 74.0 77.0 77.0 73.0 75.0 78.0 73.0 75.0 74.0
14 136.0 136.0 135.0 134.0 136.0 134.0 134.0 135.0 135.0 133.0 ... 114.0 113.0 113.0 113.0 113.0 114.0 113.0 113.0 115.0 114.0
15 121.0 121.0 122.0 122.0 120.0 123.0 122.0 123.0 122.0 121.0 ... 122.0 124.0 122.0 123.0 124.0 122.0 124.0 122.0 123.0 122.0
16 160.0 158.0 159.0 161.0 172.0 183.0 183.0 181.0 177.0 176.0 ... 224.0 223.0 226.0 224.0 226.0 223.0 224.0 223.0 225.0 226.0
17 118.0 117.0 117.0 116.0 115.0 116.0 117.0 116.0 117.0 116.0 ... 118.0 118.0 118.0 121.0 118.0 120.0 119.0 120.0 119.0 120.0
18 127.0 129.0 126.0 127.0 128.0 128.0 129.0 131.0 129.0 126.0 ... 129.0 128.0 130.0 128.0 128.0 130.0 127.0 128.0 129.0 128.0
19 18.0 14.0 17.0 16.0 17.0 17.0 20.0 16.0 15.0 12.0 ... 21.0 21.0 16.0 16.0 20.0 17.0 17.0 22.0 23.0 18.0
20 16.0 18.0 12.0 14.0 12.0 14.0 13.0 10.0 12.0 15.0 ... 238.0 236.0 233.0 239.0 236.0 234.0 236.0 236.0 239.0 237.0
21 127.0 128.0 128.0 130.0 129.0 127.0 128.0 130.0 127.0 128.0 ... 128.0 128.0 128.0 128.0 129.0 128.0 129.0 127.0 129.0 126.0
22 128.0 127.0 127.0 125.0 127.0 127.0 126.0 125.0 125.0 124.0 ... 126.0 126.0 126.0 130.0 126.0 129.0 126.0 125.0 126.0 127.0
23 163.0 163.0 163.0 163.0 162.0 163.0 162.0 162.0 160.0 162.0 ... 120.0 121.0 120.0 121.0 121.0 120.0 119.0 120.0 120.0 120.0
24 140.0 147.0 156.0 163.0 172.0 180.0 190.0 198.0 205.0 205.0 ... 109.0 110.0 109.0 109.0 110.0 109.0 111.0 109.0 111.0 109.0
25 233.0 230.0 227.0 230.0 235.0 215.0 167.0 126.0 94.0 68.0 ... 208.0 228.0 229.0 223.0 226.0 223.0 224.0 219.0 222.0 216.0
26 128.0 127.0 128.0 126.0 129.0 127.0 128.0 126.0 127.0 129.0 ... 125.0 125.0 126.0 126.0 126.0 125.0 127.0 125.0 125.0 126.0
27 102.0 102.0 100.0 101.0 100.0 99.0 101.0 101.0 99.0 99.0 ... 106.0 106.0 106.0 105.0 107.0 106.0 105.0 106.0 107.0 106.0
28 197.0 165.0 149.0 134.0 133.0 135.0 135.0 135.0 134.0 134.0 ... 125.0 126.0 125.0 124.0 125.0 127.0 126.0 134.0 148.0 169.0
29 62.0 60.0 67.0 66.0 64.0 59.0 62.0 62.0 64.0 59.0 ... 58.0 64.0 62.0 61.0 65.0 60.0 64.0 62.0 57.0 61.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1970 27.0 26.0 27.0 27.0 26.0 27.0 26.0 27.0 28.0 27.0 ... 26.0 28.0 26.0 28.0 27.0 27.0 27.0 27.0 27.0 27.0
1971 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1972 56.0 56.0 56.0 56.0 55.0 55.0 55.0 55.0 55.0 56.0 ... 54.0 55.0 54.0 54.0 54.0 55.0 55.0 54.0 55.0 55.0
1973 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 ... 60.0 61.0 61.0 61.0 60.0 62.0 61.0 62.0 63.0 63.0
1974 0.0 0.0 0.0 0.0 0.0 67.0 66.0 67.0 66.0 65.0 ... 34.0 34.0 36.0 43.0 55.0 38.0 0.0 0.0 0.0 0.0
1975 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 ... 58.0 58.0 57.0 59.0 59.0 58.0 59.0 59.0 59.0 63.0
1976 120.0 119.0 119.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 ... 43.0 44.0 44.0 44.0 44.0 44.0 44.0 46.0 46.0 44.0
1977 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 255.0 ... 52.0 52.0 51.0 51.0 51.0 50.0 52.0 52.0 52.0 51.0
1978 68.0 69.0 68.0 69.0 69.0 69.0 67.0 66.0 69.0 66.0 ... 41.0 41.0 40.0 40.0 40.0 39.0 41.0 48.0 49.0 89.0
1979 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1980 4.0 7.0 5.0 5.0 5.0 6.0 7.0 5.0 7.0 7.0 ... 5.0 7.0 5.0 7.0 6.0 4.0 5.0 8.0 7.0 5.0
1981 67.0 70.0 69.0 69.0 68.0 66.0 69.0 65.0 68.0 69.0 ... 69.0 68.0 67.0 69.0 67.0 69.0 68.0 69.0 69.0 68.0
1982 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1983 56.0 57.0 60.0 59.0 56.0 58.0 58.0 55.0 57.0 56.0 ... 68.0 71.0 72.0 75.0 75.0 74.0 73.0 74.0 70.0 81.0
1984 144.0 132.0 125.0 118.0 117.0 122.0 123.0 123.0 124.0 122.0 ... 71.0 72.0 73.0 72.0 71.0 72.0 72.0 71.0 64.0 82.0
1985 28.0 28.0 27.0 25.0 27.0 26.0 27.0 26.0 27.0 26.0 ... 27.0 27.0 27.0 25.0 26.0 24.0 29.0 26.0 26.0 27.0
1986 0.0 0.0 0.0 0.0 0.0 86.0 86.0 86.0 86.0 86.0 ... 57.0 57.0 56.0 57.0 58.0 62.0 0.0 0.0 0.0 0.0
1987 9.0 8.0 8.0 9.0 11.0 7.0 7.0 8.0 9.0 7.0 ... 8.0 9.0 9.0 7.0 6.0 10.0 9.0 8.0 9.0 8.0
1988 85.0 91.0 89.0 89.0 86.0 92.0 99.0 91.0 88.0 89.0 ... 56.0 57.0 57.0 57.0 58.0 57.0 58.0 59.0 60.0 60.0
1989 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1990 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1991 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1992 5.0 6.0 7.0 5.0 7.0 4.0 7.0 4.0 5.0 6.0 ... 6.0 6.0 7.0 6.0 6.0 4.0 7.0 6.0 7.0 6.0
1993 10.0 9.0 10.0 11.0 9.0 8.0 10.0 11.0 9.0 9.0 ... 9.0 11.0 9.0 9.0 9.0 10.0 9.0 9.0 12.0 10.0
1994 117.0 126.0 128.0 128.0 128.0 118.0 119.0 113.0 113.0 113.0 ... 66.0 66.0 66.0 67.0 68.0 68.0 67.0 67.0 64.0 67.0
1995 66.0 65.0 63.0 68.0 68.0 69.0 66.0 64.0 68.0 66.0 ... 62.0 61.0 61.0 63.0 62.0 64.0 63.0 63.0 63.0 63.0
1996 12.0 11.0 12.0 12.0 10.0 12.0 10.0 8.0 10.0 12.0 ... 10.0 8.0 10.0 8.0 8.0 6.0 7.0 8.0 7.0 7.0
1997 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1999 9.0 8.0 8.0 7.0 8.0 8.0 9.0 8.0 7.0 7.0 ... 8.0 8.0 9.0 6.0 8.0 7.0 7.0 6.0 8.0 7.0

2000 rows × 65536 columns

1.2 Visualizing Some Images

In [7]:
from skimage.io import imshow

# shamelessly stolen from https://github.com/eclarson/MachineLearningNotebooks/blob/master/04.%20Dimension%20Reduction%20and%20Images.ipynb

def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        
        try:
            plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        except:
            plt.imshow(images.iloc[i].reshape((h, w)), cmap=plt.cm.gray)            
            
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
        
labels = ['ID: ' + str(label_data.id[i]) + ' age: ' + str(label_data.boneage[i]) for i in range(label_data.shape[0])]

plot_gallery(image_data, labels, 256, 256)
/usr/lib/python3.5/site-packages/ipykernel_launcher.py:13: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  del sys.path[0]
/usr/lib/python3.5/site-packages/ipykernel_launcher.py:15: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead
  from ipykernel import kernelapp as app

Fortunately, all of these images are in grayscale, so we don't have to deal with converting RGB values to grayscale or any such other nonsense. However, the data is a bit messy with regards to the actual x-ray pictures. Many of the images are off-center or badly scaled or rotated such that they don't match up with the rest of the images in the dataset.

As we don't really have a good automated way of dealing with these issues (doing so would require us to already have a classifier for the data), we will simply proceed and see what we can make of the data that we have. However, the methods used here are somewhat simplistic, so this may cause us some problems.

3. Reducing the Data

3.1 Linear Dimenionality Reduction using PCA and Randomized PCA

And now, the fun part.

In [11]:
from sklearn.decomposition import PCA

# fit a linear Principal Components Analysis model to the data

pca = PCA(n_components=500)
%time pca.fit(image_data)
CPU times: user 5min 27s, sys: 39.8 s, total: 6min 7s
Wall time: 1min 43s
Out[11]:
PCA(copy=True, iterated_power='auto', n_components=500, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
In [12]:
eigenhands = pca.components_
eigenlabels = ['eigenhand ' + str(i) for i in range(eigenhands.shape[0])]

plot_gallery(eigenhands, eigenlabels, 256, 256)

So, shockingly enough, the eigenhands that our principal components give us look somewhat like hands. For the sake of curiosity, let's see if this holds for much higher-value eigenhands. We pick eigenhand 387 to start with for fun.

In [13]:
plot_gallery(eigenhands[387:], eigenlabels[387:], 256, 256)

The surprises don't stop. Despite the first few eigenhands looking for hands, as the variance explained by the eigenhand drops, so does the actual resemblance to a hand. Fitting, as we should expect the decreasing relevance to the actual shape of the hand to be correlated to less explained variance in the data, as the presence of a seemingly-random set of gray pixels in the frame shouldn't have much to do with the actual object contained in the data.

In [14]:
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
    
plot_explained_variance(pca)

Excellent! With just 500 principal components (down from the 66,000 from earlier), we are able to capture 97% of the variance in a given image. We will keep all of these principal components. For a less detail-oriented task, we would expect less percentage of explained variance, say 95%, to be sufficient. However, here, we would expect the difference between the bone structure of a given person and the bone structure of someone else who is of similar, but not quite the same, age to be very nuanced, so we will leave in all 500 of the principal components we have computed here.

In [17]:
from skimage.measure import compare_ssim as ssim
import warnings

warnings.filterwarnings("ignore", category=FutureWarning) 

def plot_reconstruction(data, idx, pca, display=True):
    
    '''Plots an original image from the dataset 
    alongside its reconstruction after being passed 
    through PCA model'''
    
    # transform and reconstruct an image from dataset
    orig_img = np.array(data.iloc[idx])
    trans_img = pca.transform(orig_img.reshape(1,-1))
    restruct_img = pca.inverse_transform(trans_img).reshape(65536)

    # SSIM for determining similarity between images
    ssim_restruct = ssim(orig_img, restruct_img)
    
    if display:
        fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
        ax1.imshow(orig_img.reshape((256,256)), cmap=plt.cm.gray)
        ax1.set_title('Original Image')
        ax1.set_xticks(())
        ax1.set_yticks(())
        
        ax2.imshow(restruct_img.reshape((256,256)), cmap=plt.cm.gray)
        ax2.set_title('Reconstructed Image')
        ax2.set_xticks(())
        ax2.set_yticks(())
        
        plt.show()

    return ssim_restruct

idx = np.random.choice(image_data.shape[0])
ssim_recon = plot_reconstruction(image_data, idx, pca)

print('SSIM:{:.2f}'.format(ssim_recon))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-0dd13017b5e7> in <module>()
     35 
     36 idx = np.random.choice(image_data.shape[0])
---> 37 ssim_recon = plot_reconstruction(image_data, idx, pca)
     38 
     39 print('SSIM:{:.2f}'.format(ssim_recon))

NameError: name 'pca' is not defined

Reconstruction by PCA appears to work very well. All of the reconstructed images here are very similar in structure to their original counterparts, albeit with some added noise and sometimes shifted colors. SSIM was chosen to serve as a metric for reconstruction fidelity as our prediction task is much more concerned with the structure of the x-ray data (which SSIM was designed to capture) than random noise in the background which would get picked up by something like PSNR or MSE. Additionally, some basic heuristic tests indicate that, of those three, SSIM seems to perform the best at determining which images are similar. PCA reconstruction achieves reasonable scores with SSIM. This will be compared with other methods of reconstruction later on.

Now, we shall move on to Randomized PCA, which promises to fit in less time than standard PCA by using a matrix of lower rank.

In [16]:
from sklearn.decomposition import RandomizedPCA

rpca = RandomizedPCA(n_components=500)
%time rpca.fit(image_data)

r_eigenhands = rpca.components_
plot_gallery(r_eigenhands, eigenlabels, 256, 256)
/usr/lib64/python3.5/site-packages/sklearn/utils/deprecation.py:58: DeprecationWarning:

Class RandomizedPCA is deprecated; RandomizedPCA was deprecated in 0.18 and will be removed in 0.20. Use PCA(svd_solver='randomized') instead. The new implementation DOES NOT store whiten ``components_``. Apply transform to get them.

CPU times: user 3min 20s, sys: 24.7 s, total: 3min 45s
Wall time: 1min 2s

If we look back at the first 18 eigenhands from the standard, linear method, we see that our new, randomized eigenhands are nearly identical to the old ones. In fact, we can visualize this difference.

In [17]:
plot_gallery(r_eigenhands - eigenhands, eigenlabels, 256, 256)

The difference between the old eigenhands and the new eigenhands results in low-value, gray noise, which seems to indicate that most of the important data has been captured. It may cause us to lose a little bit of explained variance as we look to higher-order eigenhands, but we appear to have most of the information that we could need, and RPCA took over a minute less to run.

One interesting pattern that can be seen here and in the previous examinations of eigenhands is that much of the variance appears to happen in our corners around the center of the image. This is likely due to the inconsistent framing of the actual x-ray images. These "corners" are likely where their corners tend to fall.

To check our previous comment about explained variance, let's look at the actual explained variance by the RPCA eigenhands.

In [18]:
plot_explained_variance(rpca)

We once again have explained 98% of the variance in our data with only the first 500 principal components. We can conclude that there is almost no difference between running PCA and RPCA.

A more direct visual comparison will come later, once we can compare both of these methods and the kernel method.

3.2 Non-Linear Dimensionality Reduction using KPCA

In [19]:
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=500, kernel='rbf',
                 gamma=10, fit_inverse_transform=True)
%time kpca.fit(image_data)
CPU times: user 2min 48s, sys: 5.96 s, total: 2min 54s
Wall time: 47.2 s
Out[19]:
KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
     fit_inverse_transform=True, gamma=10, kernel='rbf',
     kernel_params=None, max_iter=None, n_components=500, n_jobs=1,
     random_state=None, remove_zero_eig=False, tol=0)

Try looking at distributions in PSNR or MSE. Maybe do paired T-test (don't care about assumption checking).

Also, try doing nearest-neighbor classifier for linear and non-linear PCA. Taking linear distance between projections and maybe MSE on boneage differences. Also, attempt to give error based on "Is within a year of each other?"

Surprisingly, the kernel method, which often runs much more slowly than the regular, linear incarnation of PCA, actually executed much faster. Let's check out how well it's doing.

In [20]:
idx = np.random.choice(range(image_data.shape[0]))
plot_reconstruction(image_data, idx, kpca)
Out[20]:
0.6295577672794682

The reconstructions from Kernel PCA have very high fidelity to the original images. And unlike the linear models, there is very little random noise in the background. The metrics that we are testing for indicate that it is

In [21]:
def plot_reconstruction_comparison(data, idx, pca, rpca, kpca):
    orig = data.iloc[idx].reshape(1,-1)
    
    pca_restruct = pca.inverse_transform(pca.transform(orig))
    rpca_restruct = rpca.inverse_transform(rpca.transform(orig))
    kpca_restruct = kpca.inverse_transform(kpca.transform(orig))
    
    fig, ax = plt.subplots(nrows=2, ncols=2)
    
    ax[0][0].imshow(orig.reshape(256,256), cmap=plt.cm.gray)
    ax[0][0].set_title('Original Image')
    ax[0][0].set_xticks(())
    ax[0][0].set_yticks(())

    
    ax[0][1].imshow(pca_restruct.reshape(256,256), cmap=plt.cm.gray)
    ax[0][1].set_title('PCA Reconstruction')
    ax[0][1].set_xticks(())
    ax[0][1].set_yticks(())
    
    ax[1][0].imshow(rpca_restruct.reshape(256,256), cmap=plt.cm.gray)
    ax[1][0].set_title('RPCA Reconstruction')
    ax[1][0].set_xticks(())
    ax[1][0].set_yticks(())

    ax[1][1].imshow(kpca_restruct.reshape(256,256), cmap=plt.cm.gray)
    ax[1][1].set_title('KPCA Reconstruction')
    ax[1][1].set_xticks(())
    ax[1][1].set_yticks(())
    
    plt.tight_layout
In [22]:
ix = np.random.choice(range(image_data.shape[0]))
plot_reconstruction_comparison(image_data, ix, pca, rpca, kpca)

Qualitatively, we can see KPCA consistently outperforming its linear counterparts. It is much less grainy; it more accurately captures the light levels in the image, and its border lines are more well-defined than those in RPCA and full PCA. Additionally, though this isn't visible in every image, it suffers less from the issue that linear PCA does, wherein images whose subjects are in a small window off to the edge will leave shadows of the principle eigenhands in the midddle of the image. Now let's compare them all quantitatively, by building a Nearest-Neighbors classifier and determining accuracy, then running a paired, two-sample t-test to see if KPCA actually outperforms linear PCA.

These cells are off-limits

In [23]:
# restruct_pca = pd.DataFrame(np.zeros(image_data.shape), 
#                             index=image_data.index,
#                             columns=image_data.columns)

# for i in range(image_data.shape[0]):
#     if i%100==0:
#         print(i)
#     restruct_pca.iloc[i] = pca.inverse_transform(
#         pca.transform(image_data.iloc[i].reshape(1,-1)))
    
# pickle.dump(restruct_pca, open(os.path.join(os.pardir, 
#                                             'data',
#                                             'boneage-compressed',
#                                             'restruct_pca.p'),
#                                'wb'))

# restruct_kpca = pd.DataFrame(np.zeros(image_data.shape), 
#                             index=image_data.index,
#                             columns=image_data.columns)

# for i in range(image_data.shape[0]):
#     if i%100==0:
#         print(i)
#     restruct_kpca.iloc[i] = kpca.inverse_transform(
#         kpca.transform(image_data.iloc[i].reshape(1,-1)))
    
# pickle.dump(restruct_kpca, open(os.path.join(os.pardir,
#                                              'data',
#                                              'boneage-compressed',
#                                              'restruct_kcpa.p'), 'wb'))

Start running again

In [13]:
restruct_pca = pickle.load( open(os.path.join(os.pardir, 
                                              'data',
                                              'boneage-compressed',
                                              'restruct_pca.p'),
                                'rb') )

restruct_kpca = pickle.load( open(os.path.join(os.pardir, 
                                               'data',
                                               'boneage-compressed',
                                               'restruct_kpca.p'),
                                 'rb') )

Now that we have the full reconstruction matrices, we have two tasks to perform:

  1. Get the hit rate for a nearest-neighbor algorithm that uses ssim, if we consider a hit to be within 1 year (or 12, as the data is given in months) using both PCA and KPCA
  2. Determine whether or not KPCA performs better than PCA at reconstructing images, to within a statistically significant margin. We will do this based both on their accuracy in part 1 and based on their ssim scores for reconstruction
In [8]:
def struct_sim(data1, data2):#, idx1, idx2):
    return ssim(data1, data2)

def get_sim_matr(data1, data2):

    df_sim = pd.DataFrame(np.zeros((image_data.shape[0],
                                    image_data.shape[0])),
                          index=label_data.id, 
                          columns=label_data.id)
    
    for i in range(data1.shape[0]):
        if i%20 == 0: # gives a feel for how fast things are going
            print(i)
        for j in range(data2.shape[0]):
            df_sim[label_data.id[i]][label_data.id[j]] = \
            struct_sim(data1[i], data2[j])
    return df_sim

def get_sim_vec(data1, data2):
    vec_sim = pd.Series(np.zeros(image_data.shape[0]),
                        index=label_data.id)
    
    for i in range(data1.shape[0]):
        vec_sim[i] = struct_sim(data1[i], data2[i])
    
    return vec_sim
In [153]:
%time pca_vec = get_sim_vec(image_data, restruct_pca)
%time kpca_vec = get_sim_vec(image_data, restruct_kpca)
CPU times: user 2.28 s, sys: 11.7 ms, total: 2.29 s
Wall time: 2.5 s
CPU times: user 2.3 s, sys: 0 ns, total: 2.3 s
Wall time: 2.31 s
In [28]:
# estimate how long this next bit should last with all the data
print('{:.0f} min'.format(2 * 2.68 * 2000 / 60))
179 min

That's quite a long time. Maybe let's try subsampling a little. Say we use a fourth of each dataset.

In [29]:
print('{:.0f} min'.format(2 * 2.68 * 2000 / 60 / 16))
11 min

That's a little nicer. We'll try just using the first 500 entries of each dataset to get the kNN classifier.

In [60]:
def get_sim_matr(data1, data2, method=struct_sim):

    df_sim = pd.DataFrame(np.zeros((image_data.shape[0],
                                    image_data.shape[0])),
                          index=label_data.id, 
                          columns=label_data.id)
    
    for i in range(int(data1.shape[0]/4)): # here, we made changes
        if i%100 == 0:
            print(i) # gives a feel for how fast things are going
        for j in range(int(data2.shape[0]/4)): # here, we made changes
            df_sim[label_data.id[i]][label_data.id[j]] = \
            method(data1[i], data2[j])
    return df_sim
In [61]:
%time pca_nn = get_sim_matr(image_data, restruct_pca)
%time kpca_nn = get_sim_matr(image_data, restruct_kpca)
0
100
200
300
400
CPU times: user 3min 44s, sys: 236 ms, total: 3min 44s
Wall time: 3min 45s
0
100
200
300
400
CPU times: user 3min 44s, sys: 1.21 s, total: 3min 46s
Wall time: 3min 47s

For the purposes of this analysis, we won't check the usual t-test assumptions, as this is mostly a heuristic check anyways.

In [32]:
from scipy.stats import ttest_rel

t, p = ttest_rel(pca_vec, kpca_vec)
print('The two-sample paired t-test gives us a p-value of {:.3f} for our null hypothesis that PCA and KPCA perform as well as each other'.format(p))
The two-sample paired t-test gives us a p-value of 0.000 for our null hypothesis that PCA and KPCA perform as well as each other
In [33]:
print('PCA SSIM scores:\n', 
      pca_vec.head(), 
      '\n------------------\n', 
      'KPCA SSIM scores:\n', 
      kpca_vec.head(),
      '\n------------------\n')
PCA SSIM scores:
 id
1377    0.991617
1378    0.991570
1379    0.991427
1380    0.991308
1381    0.991212
dtype: float64 
------------------
 KPCA SSIM scores:
 id
1377    0.607401
1378    0.607837
1379    0.606604
1380    0.607771
1381    0.606729
dtype: float64 
------------------

Whether or not this is an issue with the particular metric we are using or an actual indication of some level of fidelity that PCA captures but KPCA fails to capture, the t-test and a cursory glance at the actual SSIM scores indicate that, according to this metric, PCA performs radically better than its non-linear cousing.

We can't do much more than conjecture about why this might be. One possible explanation is that SSIM is very sensitive to slight structural diferences, which may arise with KPCA, though it tends to be very robust to random noise, which PCA comes with a lot of. PCA is more likely to get the exact structure correct, even if it looks perceptibly different, whereas KPCA is more likely to give us something that looks like the structure but may have some slight transformations (also, see this paper, which examines SSIM as an alternative to MSE and looks at both of their limitations).

In [62]:
# oops

pca_knn = pd.DataFrame(pca_nn[0:499], 
                       index=label_data.id[0:499], 
                       columns=label_data.id[0:499])
kpca_knn = pd.DataFrame(kpca_nn[0:499], 
                       index=label_data.id[0:499], 
                       columns=label_data.id[0:499])
In [42]:
def is_hit(idx1, idx2):
    return 1 if (label_data.boneage[idx1] - 
            label_data.boneage[idx2]) <= 12 else 0

def get_hit_rate(data, max=True):
    total = 0
    for idx in label_data.id[:data.shape[0]]:
        if max:
            data.loc[idx][idx] = 0
            closest = np.argmax(data.loc[idx])
        else:
            data.loc[idx][idx] = np.inf
            closest = np.argmin(data.loc[idx])
        total += is_hit(idx, closest)
    return total/data.shape[0]
In [186]:
pca_hit_rate = get_hit_rate(pca_knn)
kpca_hit_rate = get_hit_rate(kpca_knn)
In [187]:
print('The nearest neighbor classifier for linear PCA returns an image of age within 12 months {:.2f}% of the time.\n\n'.format(pca_hit_rate*100), 'The nearest neighbor classifier for kernel PCA returns an image of age within 12 months {:.2f}% of the time.'.format(kpca_hit_rate*100))
The nearest neighbor classifier for linear PCA returns an image of age within 12 months 60.92% of the time.

 The nearest neighbor classifier for kernel PCA returns an image of age within 12 months 61.72% of the time.

Despite the earlier strangeness (which may still have to do with unexpected interaction with the metric), kernel PCA seems to do just slightly better on this dataset than linear. Granted, this is alsmost certainly not statistically significant, and it would be very difficult to justify a claim that KPCA is "better" based on the 1% increase in classification using a nearest-neighbor classifier on a single small dataset, but for now, it seems to be performing more strongly.

3.3 Feature Extraction

For initial feature extraction, we will look at the daisy method of using histograms to etimate the gradient at given points throughout the image.

In [77]:
from skimage.feature import daisy
from skimage.io import imshow

def daisy_vis(image, step=10, radius=20, rings=3, histograms=8, orientations=8, visualize=False):
    if visualize:
        desc, img = daisy(image.reshape(256,256), 
                          step=step, 
                          radius=radius, 
                          rings=rings, 
                          histograms=histograms,
                          orientations=orientations,
                          visualize=visualize
                         )
        imshow(img)
        
    else:
        desc = daisy(image.reshape(256,256), 
                     step=step, 
                     radius=radius, 
                     rings=rings, 
                     histograms=histograms,
                     orientations=orientations,
                     visualize=visualize
                    )
    return desc.reshape(-1)

idx = np.random.choice(image_data.shape[0])
dsy = daisy_vis(image_data.iloc[idx], step=100, histograms=6, visualize=True)
/usr/lib64/python3.5/site-packages/skimage/io/_plugins/matplotlib_plugin.py:77: UserWarning:

Float image out of standard range; displaying image with stretched contrast.

Don't run this every time. Just load the data that's been pickled

In [88]:
# dsy_mat = np.apply_along_axis(daisy_vis, 1, image_data, step=25)

# pickle.dump(dsy_mat, open(os.path.join(os.pardir, 'data', 'boneage-compressed', 'daisy_mat.p'), 'wb'))

Start running cells again

In [55]:
dsy_mat = pickle.load(open(os.path.join(os.pardir, 'data', 'boneage-compressed', 'daisy_mat.p'), 'rb'))
In [58]:
dsy_mat = pd.DataFrame(dsy_mat, index=label_data.id)
daisy_nn = get_sim_matr(dsy_mat, dsy_mat)
0
100
200
300
400
In [165]:
daisy_knn = pd.DataFrame(daisy_nn, 
                         index=label_data.id[0:499], 
                         columns=label_data.id[0:499])
daisy_knn.head()
Out[165]:
id 1377 1378 1379 1380 1381 1382 1383 1384 1385 1387 ... 1929 1931 1932 1933 1934 1935 1936 1937 1938 1939
id
1377 1.000000 0.997023 0.994228 0.992101 0.978064 0.986358 0.986227 0.995305 0.990534 0.988728 ... 0.986251 0.987431 0.992237 0.988487 0.968927 0.947781 0.982401 0.987416 0.975490 0.983180
1378 0.997023 1.000000 0.997203 0.991223 0.978610 0.989986 0.988096 0.992116 0.991532 0.994161 ... 0.991844 0.991366 0.993090 0.992450 0.976652 0.954765 0.984662 0.990400 0.982222 0.988047
1379 0.994228 0.997203 1.000000 0.994970 0.976816 0.987401 0.989423 0.994683 0.990844 0.991793 ... 0.987493 0.989382 0.994506 0.989851 0.969578 0.948714 0.984471 0.989423 0.977281 0.985896
1380 0.992101 0.991223 0.994970 1.000000 0.988785 0.990372 0.987399 0.993958 0.990230 0.987021 ... 0.982141 0.984710 0.991561 0.985201 0.962844 0.942599 0.980813 0.986054 0.971999 0.981296
1381 0.978064 0.978610 0.976816 0.988785 1.000000 0.993045 0.976713 0.974781 0.979652 0.978708 ... 0.974771 0.972886 0.972943 0.973878 0.960583 0.939508 0.965799 0.973251 0.967501 0.971590

5 rows × 499 columns

In [104]:
dsy_hit_rate = get_hit_rate(daisy_knn)
print('-------------------------\n',
     'Daisy NN Prediction Rate: \n{:.4f}'.format(dsy_hit_rate),
     '\n-------------------------\n',
     'PCA NN Prediction Rate: \n{:.4f}'.format(pca_hit_rate),
     '\n-------------------------\n',
     'KPCA NN Prediction Rate: \n{:.4f}'.format(kpca_hit_rate))
-------------------------
 Daisy NN Prediction Rate: 
0.6092 
-------------------------
 PCA NN Prediction Rate: 
0.6092 
-------------------------
 KPCA NN Prediction Rate: 
0.6172

All three prediction methods give almost exactly the same prediction rates. This is intriguing and may warrant examination. It may also be the case, once again, that SSIM is not the best metric for our particular application or for use with Daisy. Let's take a quick glance at the similarity matrices to get an understanding of what might be going on. We will use heatmaps to try and get a feel for the state of our data.

In [185]:
sns.heatmap(pca_knn, vmin=0, vmax=1)
Out[185]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4568033780>
In [181]:
sns.heatmap(kpca_knn, vmin=0, vmax=1)
Out[181]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f456860e9e8>
In [182]:
sns.heatmap(daisy_knn, vmin=0, vmax=1)
Out[182]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f456b114048>

The PCA and KPCA similarity scores indicate a troubling pattern in the data, that nearby observations are much more similar than distant observations. In particular, there appear to be two sub-categories of the data, which exhibit very specific trends relative to each other. What is roughly the first half and the secondly half seem to be mostly intra-correlated with each other, with the correllation falling away as the indices get farther apart. The two halves, however, appear to be poorly intercorrelated with each other, further strengthening the idea that the more distant two indices are from each other, the weaker their correlation is.

Needless to say, this is entirely ddue to the construction of the dataset, and not the actual data itself. It holds nothing but confounders and problems for the data scientist.

No indication of this was given with the dataset, and there are no readily perceptible patterns in the boneage or gender data that would indicate that we should expect such a trend. Discounting the possibility of a simple error in code (which is, nonetheless, very possible), we have to conclude that there is some pattern in the ordering of the data which we can't quite see.

Whatever is actually going on, these similarity scores and prediction rates may not be the most accurate and should likely be discounted. Regardless, they were poor to begin with.

Now, let's look at some heatmaps of the similarity data, sorted by age to see more directly what the effects of age on similarity are.

In [166]:
ages = label_data.boneage[0:499]
ages.index = label_data.id[0:499]

def sort_by_ages(data):
    copy_data = deepcopy(data)
    copy_data = copy_data.append(ages)
    copy_data = pd.concat([copy_data, ages], axis=1)
    
    copy_data.sort_values(by='boneage', axis=0, inplace=True)
    copy_data.sort_values(by='boneage', axis=1, inplace=True)
    return copy_data.drop('boneage').drop('boneage', axis=1)

First, we will sort the data based on the age of the subject, so that we can try to detect patterns based on the age of the people involved, as that is ultimately what we are trying to predict.

In [179]:
sorted_daisy = sort_by_ages(daisy_knn)
sns.heatmap(sorted_daisy, vmin=0, vmax=1)
/usr/lib64/python3.5/site-packages/pandas/core/indexes/api.py:87: RuntimeWarning: unorderable types: str() < int(), sort order is undefined for incomparable objects
  result = result.union(other)
Out[179]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f45680e8e10>

This is rather interesting. It tells us that there appear to be a few observations that are very unrelated to most other observations in the dataset, whereas most observations are highly interrelated. We can suppose that this is probably due to poor quality or bad framing of the individual x-rays.

In [183]:
sorted_pca = sort_by_ages(pca_knn)
sns.heatmap(sorted_pca, vmin=0, vmax=1)
/usr/lib64/python3.5/site-packages/pandas/core/indexes/api.py:87: RuntimeWarning: unorderable types: str() < int(), sort order is undefined for incomparable objects
  result = result.union(other)
Out[183]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4569a21588>
In [184]:
sorted_kpca = sort_by_ages(kpca_knn)
sns.heatmap(sorted_kpca, vmin=0, vmax=1)
/usr/lib64/python3.5/site-packages/pandas/core/indexes/api.py:87: RuntimeWarning: unorderable types: str() < int(), sort order is undefined for incomparable objects
  result = result.union(other)
Out[184]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f45698abb38>

There isn't a whole lot to discuss with regards to the similarity data from PCA reconstruction, as the data is all kind of an unordered mess. One of the interesting points here is similar to the one made above, that some datapoints are unusually similar to almost every other datapoint, whereas others are unusually dissimilar to almost every other datapoint.

However, these points of interest seem almost entirely uncorrellated with age, and there don't appear to be any age-localized effects of interest.

4. Local Binary Pattern

4.1 Explanation

In light of the particular data at hand (no pun intended), the extra feature extraction method that we have selected is Local Binary Pattern. This was selected over other methods (e.g. Histogram of Gradients) which were designed to detect particular classes of objects, whereas LBP was intended to distinguish between images of similar objects, of which we have many. It is also designed to be rotation invariant, which is relevant as many of the hands are at different orientations and angles.

LBP was originally designed as a texture recognition algorithm to distinguish between human faces. It uses a thresholded histogram technique similar to that of daisy, but it collects data on each pixel from each adjacent pixel to pick up on "micro-patterns" in faces and other objects at different scales for determining similarity. This weights each pixel, essentially based on how "interesting" the local neighborhood of pixels is, with the threshold for "interesting" being the weight of the pixel itself.

(Information on LBP retrieved from here)

In [23]:
from skimage.feature import local_binary_pattern

def get_bin_pattern(img, visualize=True):
    img_resh = img.reshape((256, 256))
    lbp = local_binary_pattern(img_resh, 8, 2)
    if visualize:
        fig, ax = plt.subplots(nrows=1, ncols=2)
        ax[0].imshow(lbp, cmap=plt.cm.gray)
        ax[0].set_title('Local Binary Pattern')
        ax[1].imshow(image_data.iloc[idx].reshape(256,256),
                     cmap=plt.cm.gray)
        ax[1].set_title('Original Image')
    return lbp.reshape(-1)
    
idx=np.random.choice(image_data.shape[0])
lbp = get_bin_pattern(image_data.iloc[idx])
plt.imshow(image_data.iloc[idx].reshape(256,256), cmap=plt.cm.gray)
Out[23]:
<matplotlib.image.AxesImage at 0x7f45740e40f0>

From the visualizations above, we can see roughly what LBP is doing. High weights are given to areas of high gradient, whereas areas that are mostly flat are somewhat gray and noisy. Let's now see how this performs in a prediction task using Kullback-Leibler Divergence for a Nearest-Neighbor classifier.

We choose to use KL because transfromation into LBP representations retain structure, so we should expect that those classifications would almost identically mirror running an SSIM nearest-neighbor classifier on the original data, so we need some other metric. This metric is chosen for use in the documentation, so we will try using it here and see what it looks like.

4.2 Results and Discussion

In [27]:
from copy import deepcopy

img_data_subsample = deepcopy(image_data)

lbp_matr = img_data_subsample.apply(get_bin_pattern, axis=1, visualize=False)
In [53]:
from sklearn.metrics import mutual_info_score # inverted KL Divergence

lbp_nn = get_sim_matr(lbp_matr, lbp_matr, 
                      method=mutual_info_score)
0
100
200
300
400
In [54]:
lbp_knn = pd.DataFrame(lbp_nn[0:499],
                       index=label_data.id[0:499], 
                       columns=label_data.id[0:499])
lbp_score = get_hit_rate(lbp_knn)

print('Local Binary Pattern\'s prediction rate is {:.3f}.'.format(lbp_score))
Local Binary Pattern's prediction rate is 0.589.

Well, that's rather abyssmal. It's hard to say exactly, but we can conjecture as to why this might have failed as miserably as it has. In particular, LBP was specifically designed for use on human faces, so it may be poorly optimized for use on other images. Additionally, the LBP data is quite noisy, especially in the background and in blank space, which can very quickly drive up the KL divergence and sweep away the relevant parts of the data, which are more than likely localized to a few bone and joint structures.